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1. INITIAL COMMENTS 

Closure-based multiple testing procedures for con- 
trolling the familywise error rate (FWER) have been 
around for decades, but they have not been well un- 
derstood, and hence have been under-appreciated 
and under-utilized. Goeman and Solari (GS) provide 
a service by highlighting important practical fea- 
tures of closure. Using elegant notation for closure- 
based methods, they develop a handy book-keeping 
tool for presenting additional results of closed test- 
ing that are available when non-consonant testing 
methods are used, and they prove its validity. 

In their Figure 1, GS provide the confidence set 
r({2,3}) € {0,1}, where r({2,3}) is the number of 
true nulls in the set {H2,H^}. In doing so, GS high- 
light a not-so-well known fact about closure: infer- 
ences for the additional (2™ — 1) — n composite hy- 
potheses Hj are available "free of charge" whenever 
one performs closed testing for the original n ele- 
mentary hypotheses Hi. This follows from the fact 
that "the closure of the closure is the closure;" that 
is, that no new hypotheses are generated when the 
set of 2 n — 1 intersection hypotheses is treated as 
the set of elementary hypotheses. Hence, in GS's 
Figure 1, the significance of -ff{2,3} can be stated 
with full FWER control over the set of 2 3 - 1 = 7 
hypotheses, and the conclusion r({2,3}) < 1 follows 
immediately. Again, GS provide a service in remind- 
ing statisticians (or in teaching those who have not 
heard about it in the first place) of this nice feature 
of closure. 
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GS's paper also implicitly explains the following 
paradox: while closure is based on composite hy- 
potheses, it is not true that more powerful compos- 
ite tests lead to more powerful closure-based mul- 
tiple tests. When considering only the elementary 
hypotheses, Bonferroni (or MaxT) types of compos- 
ite tests, which are usually thought to be the least 
powerful of the class of composite testing methods 
(e.g., Nakagawa, 2004), tend to give higher power for 
closure-based multiple tests (Romano, Shaikh and 
Wolf, 2011). However, when the goal is to establish 
how many true effects there might be among a col- 
lection of hypotheses, GS suggest indeed that more 
powerful composite tests lead to more powerful mul- 
tiple tests. 

The Fisher combination test is a useful choice of 
composite test, as noted by GS. But it is worth 
pointing out how bad this test can be compared to 
the Bonferroni test, when both are used via closure 
for testing elementary hypotheses. Consider analyz- 
ing a version (available from the author) of the clas- 
sic dataset reported by Golub at al. (1999), testing 
7,129 genes for association with either acute myeloid 
or acute lymphoblastic leukemia, using 7,129 two- 
sample t-tests. The closed Fisher combination meth- 
od is and has been available in PROC MULTTEST 
of SAS/STAT with the 0(n 2 ) shortcut since release 
8.1 of SAS in 2000; this software computes closure- 
based adjusted p-values (defined below) to assess 
significance of elementary hypotheses. Despite the 
fact that the Fisher combination test is liberal with 
correlated data, the smallest adjusted p-value using 
the closed Fisher combination test is 1.000 (rounded), 
hence none of the 7,129 tests are significant at any 
reasonable nominal FWER level. On the other hand, 
37 of the 7,129 genes have adjusted p-values less 
than the nominal 0.05 FWER level when using closed 
Bonferroni (or Holm, 1979) tests; the smallest ad- 
justed p- value is 1.7 x 10 -6 and is therefore extremely 
significant, even after multiplicity adjustment. 

I have some other comments/critiques about the 
paper that fall into the following categories: (i) the as- 
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Fig. 1. Closed testing with Fisher combination tests in a one-way AN OVA setting, ignoring logical constraints. 



sumption of free combinations and its consequences, 
(ii) use of adjusted p- values rather than rigid nom- 
inal thresholds, (iii) computational shortcuts, and 
(iv) permutation testing. 

2. ADDITIONAL COMMENTS 

2.1 Free Versus Restricted Combinations 

Implicit in GS's discussion of closure is that the 
elementary hypotheses obey the free combinations 
condition, which states that there are 2 n — 1 distinct 
hypotheses in the closure. Under restricted combi- 
nations there are duplicates, and hence the set of 
intersections has many fewer elements; by exploit- 
ing this fact one can obtain tighter confidence sets. 
For example, suppose Yj ~ N(fii, 1), with H\ : /ii = 
fj,2 , H2 ■ [i\ = /J 3 and H% : fi2 = H3- Then there are 
only four elements in the closure rather than 2 3 — 



1 = 7, since H 
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method is valid but conservative when all seven hy- 
potheses are considered. 

For example, suppose the data are y± = —2, 7/2 = 
and 2/3 = +2, yielding z-statistics z\ = (—2 — 0)/ 
2 x /2 = -1.414, Z2 = -2.828 and z 3 = -1.4141, with 
corresponding two-sided p- values p\ = 0.157299, 
p 2 = 0.004678 and p 3 = 0.157299. The Fisher combi- 
nations statistics are thus c\2 = C23 = —2 x 
ln(0. 157299 x 0.004678) = 14.4291, c 13 = 7.3984 and 
C123 = 18.1283. The chi-squared distribution cannot 
be used to find p-values for these composite tests 
since the Z's are not independent, but under the 
null hypothesis, the vector of Z statistics is multi- 
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Thus, the p- values can be 



obtained by simulating Z's from this distribution, 
computing the two-sided p- values, constructing the 
Fisher combination statistics Cj, and counting how 
often the simulated C\ exceeds the observed cj . Fig- 
ure 1 displays the results using these p-values for 
each subset /, as well as closure-based adjusted p- 
values. 

Suppose that inference is considered for the set 
{H 1,^3}. Here, the confidence set for the number 
of true nulls is {0,1,2}, since H13 is not rejected. 
But the possibility that t({1, 3}) = 2 contradicts the 
rejection of the global hypothesis #123, and thus 
seems wrong. 

Incorporating logical constraints, the graph is as 
shown in Figure 2. Using logical constraints, the con- 
fidence set for t({1, 3}) is {0, 1} rather than {0, 1,2}. 

One can improve the power of closure-based con- 
sonant procedures as well by utilizing logical con- 
straints (Westfall and Tobias, 2007). 
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Fig. 2. Closed testing with Fisher combination tests in 
a one-way AN OVA setting, incorporating logical constraints. 



DISCUSSION 
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2.2 Adjusted p-Values 

Adjusted p- values are simple and natural by-pro- 
ducts of closure. Let pj be the local p-value for test- 
ing Hj. With closure, Hj is rejected only when Hj 
is rejected for all J D /, or equivalently, when 
max p j < a, where a is the nominal FWER. 
Hence maxj^/pj is the adjusted p- value for test- 
ing Hj, and these are shown in my Figure 1. 

As GS note, exploratory inference should be mild, 
flexible and post hoc. However, the use of a strict 
0.05 (or other) nominal FWER threshold seems to 
violate the latter two of these criteria. For the same 
reasons that ordinary p- values are seen as more nat- 
ural and useful than the 0.05-level determined "ac- 
cept/reject" decision, it is also more natural and 
useful to report an adjusted p-value along with any 
claim about the number of true alternatives within 
a set of hypotheses. 

For example, suppose my Figure 1 was from a case 
of free combinations, as with GS's Figure 1. Then 
for the set {Hi,H$}, one cannot claim any alterna- 
tives at the usual 0.05 nominal FWER level, but one 
can conclude at least one alternative at the nominal 
0.13 level. The report could state "For familywise 
significance levels as low as 0.125, there is at least 
one alternative among {H\,H^}. V 

In GS's discussion of Huang and Hsu's n = 4 ex- 
ample where there are no elementary significances, 
their conclusion is "at least two out of the first three 
hypotheses are false." After calculating the adjusted 
p-values for these data, one can say "at least two 
out of the first three hypotheses are false (adjusted 
p = 0.038)." With other data, the conclusion might 
be that "at least two out of the first three hypotheses 
are false (adjusted p = 0.001)," which communicates 
quite different information, even though the claimed 
number of alternatives is the same at the nominal 
FWER = 0.05 level. 

Yet another benefit of adjusted p- values is that 
they offer a more realistic assessment in the face 
of violated assumptions. Assumptions are usually 
wrong, and an adjusted p-value of 0.055 might be 
more appropriately reported as 0.041 with a more 
correct analysis; conversely, 0.045 might be more 
appropriately reported as 0.053. Use of adjusted p- 
values rather than fixed decisions better recognizes 
this fact, as savvy readers understand that p- values 
are themselves approximations, and can use their 
own knowledge or simulation studies to assess the 
accuracy of a "0.045" report. 

A disadvantage of using adjusted p- values rather 
than "accept /reject" decisions is that there are ad- 



ditional computations. But this disadvantage seems 
minor to me compared to problems with rigidly fixed 
nominal FWER levels. 

2.3 Computational Shortcuts 

The methodology GS espouse can be computatio- 
nally prohibitive. While closure allows a simple 0{n) 
shortcut in the case of the consonant Bonferroni- 
Holm procedure, the GS methods will require some- 
thing approaching 0(2 n ) evaluations for most other 
cases of interest. Shortcuts are available, with less 
power as GS note. Westfall and Tobias (2007) use 
a tree-based representation of the 2 n — 1 hypothe- 
ses, along with a branch-and-bound algorithm for 
obtaining conservative, but computationally simpler 
analyses. These methods are available in a wide va- 
riety of SAS/STAT procedures as of version 9.2 of 
SAS. 

Oddly, GS do not mention Hommel's (1988) 0(n 2 ) 
closure shortcut when using Simes' test; this short- 
cut is essentially identical to the one mentioned by 
Zaykin et al. (2002) for the truncated product (and 
by special case, Fisher combination) test. 

2.4 Permutation Tests 

Permutation tests offer, under certain assumptions, 
exact rather than approximate inference. They also 
allow, in the case of binary data, exceptionally higher 
power than corresponding methods based on contin- 
uous data, by utilizing sparseness (Westfall, 2011). 
In addition, tests that assume independence require 
some correction for correlation structure, as would 
be the case for the adverse event data of Table 3 of 
GS. Hence, permutation tests are useful for gaining 
power, as well as for obtaining valid p-values. 

Problems with permutation-based testing include 
computational difficulties and hidden assumptions. 
There is the obvious computational burden of either 
enumerating or simulating the permutation distri- 
bution; doing this separately for 0(2 n ) subsets is 
impossible, even for moderate n. When the "subset 
pivotality" condition of Westfall and Young (1993) is 
valid, one can use a single global permutation distri- 
bution rather than 2 n — 1 separate permutation dis- 
tributions. The subset pivotality condition is valid 
for many multivariate models, but fails for multi- 
ple comparisons with three or more groups, since 
the global permutation distribution is not valid for 
making pairwise comparisons involving two groups. 
If the subset pivotality condition is satisfied, and 
if the (consonant) MinP tests are used, the com- 
putational burden is greatly reduced, making the 
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Westfall- Young method feasible for large-scale mul- 
tiple testing applications. 

One must also state their assumptions about the 
intersection hypotheses when doing permutation- 
based analysis. When using permutation tests, the 
simplest form of an elementwise null hypothesis is 
that the data are exchangeable between groups. 
However, the intersection of exchangeable elemen- 
twise hypotheses does not imply joint exchangeabil- 
ity. For example, consider the two-group MANOVA 
with bivariate data. If group one is bivariate nor- 
mal with mean vector and identity covariance ma- 
trix, while group two has the same mean vector but 

then the data in vari- 
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able 1 are exchangeable between the groups [specif- 
ically, i.i.d. iV(0,l)], the data in variable 2 are 
exchangeable between the groups [also i.i.d. iV(0, 1)], 
but the two-dimensional vectors are not exchange- 
able between the groups. Thus, an assumption that 
marginal exchangeability implies joint exchangeabil- 
ity is required when performing permutation- 
based closed testing with multivariate multisample 
data. 

On the other hand, with consonant Bonferroni- 
based closed permutation procedures, one can dis- 
pense with such assumptions. These methods are 
computationally simple, control the FWER for all 
sample sizes, and retain the power advantage asso- 
ciated with permutation tests; details are given by 
Westfall and Troendle (2008). 
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